Setting Up

To begin the scrape I first count the number of tiles per page on zillow, as well as the total number of houses/apartments/townhomes that are available in portland for rent to make sure I am getting accurate data from reading in the data. There are 978 locations available for rent in Portland as of format(Sys.time(), "%A %b %d, %Y").

zillow_url <- "https://www.zillow.com/portland-or/rentals/"

numberoftiles <- 40 # Number of listings per page

zillow_pg <- read_html(zillow_url) # Read in page

# count the total results
zillow_cnt <- zillow_pg %>%
  html_nodes(".result-count") %>%
  html_text() %>%
  str_remove(" results") %>%
  as.numeric()

# Set the page count
zillow_pg_cnt <- ceiling(zillow_cnt / numberoftiles)


zillow_cnt
## [1] 983

Testing the Scrape

This was very handy since I eventually realized that zillow only gives the first 10 returns from all of its .list-card-info css classes, meaning that ultimately I should only get about 200 results if everything scrapes correctly. The fact that zillow did this greatly restricts the analysis I wanted to do to determine the best rental available in Portland but I proceeded anyways with the hope that I can figure out a way around this in the future (future Duncan never did).

read_html("https://www.zillow.com/portland-or/rentals/1_p/") %>%
    html_nodes(".list-card-info") %>%
    html_text()
##  [1] "17275 NW Viola St, Portland, OR 97229$2,995/mo5 bds3 ba2,671 sqft- House for rent"           
##  [2] "16 SW Porter St APT 302, Portland, OR 97201$825/moStudio1 ba203 sqft- Apartment for rent"    
##  [3] "5793 NW 174th Ave, Portland, OR 97229$2,000/mo3 bds2.5 ba1,400 sqft- House for rent"         
##  [4] "1410 N Webster St APT 3, Portland, OR 97217$1,195/moStudio1 ba450 sqft- Apartment for rent"  
##  [5] "1831 SE Hawthorne Blvd #306, Portland, OR 97214$975/moStudio1 ba400 sqft- Apartment for rent"
##  [6] "4046 SE Holgate Blvd, Portland, OR 97202$1,395/mo2 bds1 ba760 sqft- House for rent"          
##  [7] "Deveraux Glen | 10600 SW Taylor St, Portland, OR$1,370+ 1 bd$1,510+ 2 bds"                   
##  [8] "8054 SE Lambert St, Portland, OR 97206$1,600/mo2 bds1 ba975 sqft- Apartment for rent"        
##  [9] "3745 SW Water Ave #303, Portland, OR 97239$1,545/mo1 bd1 ba655 sqft- Apartment for rent"     
## [10] ""

Performing the Scrape

Here I basically just did a bunch of inspect element to find the relevant css selectors on the info cards in the site which can be seen here. The basic format is read the page with read_html, then find the node with html_node, then depending on whether I am getting text or an attribute I use html_text or html_attr respectively. After getting everything I want (mostly price, bedrooms, and house area) I map over each page using sprintf and purrr::map.

links <- sprintf("https://www.zillow.com/portland-or/rentals/%d_p/", 1:20)
results <- map(links, ~ {
  # http://selectorgadget.com/
  # <body
  # class="photo-cards
  houses <- read_html(.x) %>%
    html_nodes(".photo-cards li article")
  z_id <- houses %>%
    html_attr("id")
  
  zillow_link <- houses %>%
    html_node(".list-card-link") %>%
    html_attr("href")
  
  image <- houses %>%
    html_node(".list-card-top a img") %>%
    html_attr("src")
  
  address <- houses %>%
    html_node(".list-card-addr") %>%
    html_text()
  
  typeofsale <- houses %>%
    html_node(".list-card-statusText") %>%
    html_text()
  
  price <- houses %>%
    html_node(".list-card-price") %>%
    html_text() %>%
    readr::parse_number()
  
  params <- houses %>%
    html_node(".list-card-info") %>%
    html_text()
  # number of bedrooms
  beds <- params %>%
    str_extract("\\d+(?=\\s*bds)") %>%
    as.numeric()
  # number of bathrooms
  baths <- params %>%
    str_extract("\\d+(?=\\s*ba)") %>%
    as.numeric()
  # total square footage
  house_a <- params %>%
    str_extract("[0-9,]+(?=\\s*sqft)") %>%
    str_replace(",", "") %>%
    as.numeric()
  
  tibble(address = address, price = price, beds= beds, baths=baths, house_area = house_a, type = typeofsale, zillow_link = zillow_link, image = image)
  
}
) %>%
  bind_rows(.id = 'page_no')

Data Cleaning

Most of this is done above, in specific area I use str_extract to get specic regex patterns depending on the card.

Data from scraping is often quite “messy” but this honestly wasn’t too bad given the work I already did. I filter for greater than 3 beds here for my personal analysis and decided to make a shiny app at this point, which is pretty sick and you should check out here. For the app, and for making links accessible in the gt table I make below I also transform the link a bit.

at_least_three <- results %>%
  filter(beds > 2) %>%
  mutate(zillow_link = map(zillow_link, ~ htmltools::a(href = .x, "Link")),
         zillow_link = map(zillow_link, ~ gt::html(as.character(.x)))) %>%
  mutate(type = str_remove_all(type, "- "))

Mapping the Data

To map the data I use tidygeocoder to find the latitude and longitude of locations based on address. I also add a column called color for using with the leaflet package, and then I make the map, again shoutout my shiny app - its even better there, and it interacts the data here with the data in the table at a larger scale.

library(tidygeocoder)
library(leaflet)

map_at_least_three <- at_least_three %>% 
  mutate(color = case_when(type == "House for rent" ~ "red",
                           type == "Apartment for rent" ~ "blue",
                           type == "Townhouse for rent" ~ "green")) %>%
  mutate(address = str_remove_all(address, " UNIT ."),
         address = str_remove_all(address, "The 1607 Apartments |")) %>%
  geocode(address, method = "arcgis", lat = latitude, long = longitude)

leaflet() %>%
  addTiles(group = "Detailed") %>%
  addProviderTiles("CartoDB.Positron", group = "Simple") %>%
  addAwesomeMarkers(lng = map_at_least_three$longitude, lat = map_at_least_three$latitude, 
                    popup = glue::glue("Address: {map_at_least_three$address}<br> Price: ${map_at_least_three$price}<br> Beds: {map_at_least_three$beds}<br> Type: {map_at_least_three$type}"), 
                    icon = awesomeIcons(icon='home', markerColor = map_at_least_three$color), 
                    group = "Subject Property") %>%
  addLegend("bottomright", labels = c("House for rent", "Apartment for rent", "Townhouse for rent"), 
            colors = c("red", "blue", "green"))

Tabulating the Data

This is me messing around with gt since I love the package, there’s a lot to see here in the code but it’s a lot of tiny twists on unnecessary details. The main thing that’s missing - only because it’s missing from the entire gt package is pagination, which may someday be a thing?

col <- colorRampPalette(c("#3e9c35", "#084f09"))
at_least_three %>%
  select(-1) %>%
  mutate(`Price per Month` = price/3) %>%
  relocate(`Price per Month`, .after = price) %>%
  arrange(`Price per Month`) %>%
  gt::gt() %>%
      tab_header(title = html("<b><span style='color:#1f6632'>Best Portland Rentals")) %>%
      cols_label(
        address = "Address",
        price = "Monthly Rent",
        beds = "Beds",
        baths = "Baths",
        house_area = "House Area",
        type = "Type",
        zillow_link = "Zillow Link",
        image = "Image"
      ) %>%
      data_color(
        columns = c(
          # `Price per Month`,
          `price`),
        colors = scales::col_numeric(c("#1B8366", "#57C478", "#FFDD0E", "#E9AE0B"), domain = NULL)
      ) %>%
      data_color(
        columns = house_area,
        colors = scales::col_numeric(
          palette = paletteer::paletteer_d(
            palette = "ggsci::red_material"
          ) %>% as.character(),
          domain = NULL
        )
      ) %>%
      # fmt_currency(columns = c(`Price per Month`)) %>%
      fmt_number(
        columns = c(price),
        pattern = "${x}",
        decimals = 0
      ) %>%
      fmt_number(
        columns = house_area,
        pattern = "{x} sq ft.",
        decimals = 0
      ) %>%
      gt::text_transform(
        locations = cells_body(columns = image),
        fn = function(x) {
          gt::web_image(
            url = x,
            height = 200
          )
        }
      ) %>%
      gt::text_transform(
        locations = cells_body(columns = type),
        fn = function(x) {
          str_remove_all(x, " for rent")
        }
      ) %>%
      gt::opt_table_font(
        font = list(
          gt::google_font("Open Sans"),
          gt::default_fonts()
        )
      ) %>%
      gt::tab_style(
        style = list(
          gt::cell_text(
            size = "medium",
            align = "center"
          )
        ),
        locations = gt::cells_body(
          columns = !c(address),
          rows = everything()
        )
      ) %>%
      cols_align(align = "center") %>%
      tab_style(
        style = cell_borders(
          sides = c("right"),
          color = "white",
          weight = px(1.5),
          style = "solid"
        ),
        locations = cells_body(
          columns = price
        )
      ) %>%
      DGThemes::gt_theme_duncan() %>%
      gt::tab_options(
        heading.border.bottom.color = "white",
        table.border.top.color = "white",
        column_labels.border.bottom.color = "white",
        column_labels.border.top.color = "white",
        table.border.left.color = "white",
        table.border.right.color = "white",
        table_body.hlines.color = "pink"
      )